Stat Rules
Stat rules are a powerful method for accessing profile statistics. These are typically sub-second operations that do not require scanning or iterating. There are several cases where SQL struggles to support rules, such as: isNull but not "null count" or nullRatio or nullPercent. Or having access to types without doing crazy cast() operations.
For example, here is a simple stat rule that breaks when the symbol column contains less than 800 rows of data that are not strings:
select * from @dataset where
fname.$type != 'String' AND $rowCount < 800
Dataset-level stat rules
Provides observability into your data at the dataset level, enabling you to detect issues with the entire dataset.
Column-level stat rules
Provides observability into your data at the column level.
Column-level Stat Rule | Rule Example | Description of Rule Example |
---|---|---|
.$type | symbol.$type != 'String' | Breaks when a row in the symbol column is not a string. |
.$min | symbol.$min > 'apple' | Breaks when a row in the symbol column is greater than a given value. $min performs a lexicographical sort that works for strings and numbers. |
.$minNum | high.$minNum > 13 |
Breaks when a row in the age column is greater than 13. The row is type-casted to a numeric for simple number checks. |
.$mean | row_id.$mean > '4.500' | Breaks when the mean is greater than a given value. |
.$max | symbol.$max > 'apple' | Breaks when a row in the symbol column is greater than a given value. |
.$maxNum | high.$maxNum > 13 | Breaks when the numeric value falls outside an acceptable range. |
.$uniqueCount | id.$uniqueCount != $rowCount | Breaks when the uniqueCount of a field doesn't match the rowCount. |
.$uniqueRatio | gender.$uniqueRatio between .4 and .6 | Breaks when the ratio of uniqueCounts of a given field doesn't match the rowCount. |
.$nullRatio | lname.$nullRatio not between .4 and .6 | Breaks when the ratio of nulls no longer falls within acceptable range. |
.$nullPercent | lname.$nullPercent not between 40 and 60 | Breaks when the percent of nulls no longer falls within acceptable range |
.$nullCount | symbol.$nullCount >= 1 | Test for a single null. |
.$emptyRatio | open.$emptyRatio > 0.2 | Breaks when the ratio of empties no longer falls within acceptable range. |
.$emptyPercent | open.$emptyPercent > 20 | Breaks when the percent of empties no longer falls within an acceptable range. |
.$emptyCount | open.$emptyCount > 100 | Breaks when the emptyCounts of a field no longer fall within an acceptable range. |
.$mixedTypeRatio | open.$mixedTypeRatio > 0.2 | Breaks when the ratio of mixed data types no longer falls within an acceptable range. For example, Strings and Ints in the same field. |
.$mixedTypePercent | open.$mixedTypeRatio > 20 |
Breaks when the percent of mixed data types no longer falls within an acceptable range. |
.$mixedTypeCount | open.$mixedTypeCount >= 1 | Breaks when the mixed data typeCount no longer falls within an acceptable range. For example, Strings and Ints in the same field. |
Distribution Rule
There is a common case in DQ where you want to know the distribution of a column's value. Consider gender. It can be expected that a column named gender consists of roughly 40-60% males and roughly 40-60% females if the data set is large and represents the population. This can be difficult to express in plain SQL, but it is very easy with the below syntax.
gender['Male'].$uniquePercent between 40 and 60
Column Value-level | Rule |
---|---|
.$uniqueCount | credit_rating['FAIR'].$uniqueCount > 7 |
.$uniquePercent | credit_rating['GOOD'].uniquePercent between 40 and 60 |
Known Limitations
- The $daysWithoutData stat rule has a limitation where rules that are not named "daysWithoutData" do not display their Pulse View charts when you drill into the rule on the Findings page. A possible workaround for this is to name your stat rule exactly as it appears here: daysWithoutData.
- You cannot combine stat rules or distribution rules with regex rules in the same rule. For example,
car_vin rlike '$[asdf][0-9]' and car_vin.$uniqueCount